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Abstract 

Stochastic gradient descent (SGD) is a sim- 
ple and popular method to solve stochas- 
tic optimization problems which arise in ma- 
chine learning. For strongly convex prob- 
lems, its convergence rate was known to be 
C(log(T)/T), by running SGD for T itera- 
tions and returning the average point. How- 
ever, recent results showed that using a dif- 
ferent algorithm, one can get an optimal 
0{l/T) rate. This might lead one to be- 
lieve that standard SGD is suboptimal, and 
maybe should even be replaced as a method 
of choice. In this paper, we investigate the 
optimality of SGD in a stochastic setting. 
We show that for smooth problems, the algo- 
rithm attains the optimal 0(1/T) rate. How- 
ever, for non-smooth problems, the conver- 
gence rate with averaging might really be 
17 (log (T)/T), and this is not just an artifact 
of the analysis. On the flip side, we show 
that a simple modification of the averaging 
step suffices to recover the 0(1/T) rate, and 
no other change of the algorithm is neces- 
sary. We also present experimental results 
which support our findings, and point out 
open problems. 



1. Introduction 

Stochastic gradient descent (SGD) is one of the sim- 
plest and most popular first-order methods to solve 
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convex learning problems. Given a convex loss func- 
tion and a training set of T examples, SGD can be 
used to obtain a sequence of T predictors, whose av- 
erage has a generalization error which converges (with 
T) to the optimal one in the class of predictors we con- 
sider. The common framework to analyze such first- 
order algorithms is via stochastic optimization, where 
our goal is to optimize an unknown convex function 
F, given only unbiased estimates of -F"s subgradients 
(see Sec. [5] for a more precise definition). 

An important special case is when F is strongly con- 
vex (intuitively, can be lower bounded by a quadratic 
function). Such functions arise, for instance, in Sup- 
port Vector Machines and other regularized learning 
algorithms. For such problems, there is a well-known 
C(log(T)/T) convergence guarantee for SGD with av- 
eraging. This rate is obtained using the analysis of 
the algorithm in the harder setting of online learning 



(Hazan et al. 2007), combined with an online-to-batch 



conversion (see ( Hazan & Kale 2011[) for more details) 



Surprisingly, a recent paper by Hazan and Kale ( Hazan 



fc Kale[ [2011) showed that in fact, an C(log(T)/T) is 
not the best that one can achieve for strongly convex 
stochastic problems. In particular, an optimal 0(1/T) 
rate can be obtained using a different algorithm, which 
is somewhat similar to SGD but is more complex (al- 
though with comparable computational complexity Q 
A very similar algorithm was also presented recently 
by Juditsky and Nesterov ( Juditsky & Nesterov 2010[ ) . 
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1 Roughly speaking, the algorithm divides the T it- 
erations into exponentially increasing epochs, and runs 
stochastic gradient descent with averaging on each one. 
The resulting point of each epoch is used as the starting 
point of the next epoch. The algorithm returns the result- 
ing point of the last epoch. 
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These results left an important gap: Namely, whether 
the true convergence rate of SGD, possibly with some 
sort of averaging, might also be 0(1/T), and the 
known C(log(T)/T) result is just an artifact of the 
analysis. Indeed, the whole motivation of (Hazan & 



Kale, 2011) was that the standard online analysis is 



too loose to analyze the stochastic setting properly. 
Perhaps a similar looseness applies to the analysis of 
SGD as well? This question has immediate practical 
relevance: if the new algorithms enjoy a better rate 
than SGD, it might indicate they will work better in 
practice, and that practitioners should abandon SGD 
in favor of them. 

In this paper, we study the convergence rate of SGD 
for stochastic strongly convex problems, with the fol- 
lowing contributions: 

• First, we extend known results to show that if F 
is not only strongly convex, but also smooth (with 
respect to the optimum), then SGD with and 
without averaging achieves the optimal 0(1/T) 
convergence rate. 

• We then show that for non-smooth F, there are 
cases where the convergence rate of SGD with 
averaging is f2(log(T)/T). In other words, the 
C(log(T)/T) bound for general strongly convex 
problems is real, and not just an artifact of the 
currently-known analysis. 

• However, we show that one can recover the op- 
timal 0(l/T) convergence rate (in expectation 
and in high probability) by a simple modification 
of the averaging step: Instead of averaging of T 
points, we only average the last aT points, where 
a G (0, 1) is arbitrary. Thus, to obtain an optimal 
rate, one does not need to use an algorithm signifi- 
cantly different than SGD, such as those discussed 
earlier. 

• We perform an empirical study on both artificial 
and real-world data, which supports our findings. 

Moreover, our rate upper bounds are shown to hold 
in expectation, as well as in high probability (up to a 
log(log(T)) factor). While the focus here is on getting 
the optimal rate in terms of T, we note that our up- 
per bounds are also optimal in terms of other standard 
problem parameters, such as the strong convexity pa- 
rameter and the variance of the stochastic gradients. 



Following the paradigm of (Hazan & Kale 2011), we 



are more general. In particular, the standard online 
analysis of SGD requires the step size of the algorithm 
at round t to equal 1/Xt, where A is the strong con- 
vexity parameter of F. In contrast, our analysis copes 
with any step size c/At, as long as c is not too small. 

In terms of related work, we note that the performance 
of SGD in a stochastic setting has been extensively re- 
searched in stochastic approximation theory (see for 
instance (Kushner fc Yin| |2003[)). However, these re- 



sults are usually obtained under smoothness assump- 
tions, and are often asymptotic, so we do not get an ex- 
plicit bound in terms of T which applies to our setting. 
We also note that a finite-sample analysis of SGD in 



the stochastic setting was recently presented in ( Bach 
& Moulines 2011). However, the focus there was dif- 



ferent than ours, and also obtained bounds which hold 
only in expectation rather than in high probability. 
More importantly, the analysis was carried out un- 
der stronger smoothness assumptions than our anal- 
ysis, and to the best of our understanding, does not 
apply to general, possibly non-smooth, strongly con- 
vex stochastic optimization problems. For example, 
smoothness assumptions may not cover the applica- 



analyze the algorithm directly in the stochastic setting, 
and avoid an online analysis with an online-to-batch 
conversion. This also allows us to prove results which 



tion of SGD to support vector machines (as in (Shalev- 
Shwartz et al. 2011 )), since it uses a non-smooth loss 
function, and thus the underlying function F we are 
trying to stochastically optimize may not be smooth. 

2. Preliminaries 

We use bold-face letters to denote vectors. Given some 
vector w, we use Wi to denote its i-th coordinate. Simi- 
larly, given some indexed vector w t , we let wtj denote 
its i-th coordinate. We let 1a denote the indicator 
function for some event A. 

We consider the standard setting of convex stochas- 
tic optimization, using first-order methods. Our goal 
is to minimize a convex function F over some convex 
domain W (which is assumed to be a subset of some 
Hilbcrt space). However, we do not know F, and the 
only information available is through a stochastic gra- 
dient oracle, which given some w £ W, produces a 
vector g, whose expectation E[g] = g is a subgradient 
of F at w. Using a bounded number T of calls to this 
oracle, we wish to find a point such that F(w t ) is as 
small as possible. In particular, we will assume that F 
attains a minimum at some w* G W, and our analysis 
provides bounds on F(w t ) — F(w*) either in expecta- 
tion or in high probability (the high probability results 
are stronger, but require more effort and have slightly 
worse dependence on some problem parameters). The 
application of this framework to learning is straightfor- 



ward (see for instance ( Shalev-Shwartz et al. 2009)) 
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given a hypothesis class W and a set of T i.i.d. exam- 
ples, we wish to find a predictor w whose expected loss 
F(w) is close to optimal over W. Since the examples 
are chosen i.i.d., the subgradient of the loss function 
with respect to any individual example can be shown 
to be an unbiased estimate of a subgradient of F. 

We will focus on an important special case of the prob- 
lem, characterized by F being a strongly convex func- 
tion. Formally, we say that a function F is X-strongly 
convex, if for all w, w' S W and any subgradient g of 
F at w, 



F(w') > F(w) + (g, w' - w) 



A 



(1) 



Another possible property of F we will consider is 
smoothness, at least with respect to the optimum w*. 
Formally, a function F is fi-smooth with respect to w* 
if for all w e W, 



F(w)-F(w*) < -||w-w 



(2) 



Such functions arise, for instance, in logistic and least- 
squares regression, and in general for learning linear 
predictors where the loss function has a Lipschitz- 
continuous gradient. 

The algorithm we focus on is stochastic gradient de- 
scent (SGD). The SGD algorithm is parameterized by 
step sizes rji, . . . , r/x, and is defined as follows: 

1. Initialize wi E W arbitrarily (or randomly) 

2. For t = 1,...,T: 

• Query the stochastic gradient oracle at w t to 
get a random g t such that E[g t ] = g t is a 
subgradient of F at w ( . 

• Let w t+ i = IIw(w t — i] t gt), where Ilyy is the 
projection operator on W. 

This algorithm returns a sequence of points 
Wi,...,Wy. To obtain a single point, one can 
use several strategies. Perhaps the simplest one is to 
return the last point, wt+i- Another procedure, for 
which the standard online analysis of SGD applies 



(Hazan et al. 2007), is to return the average point 



1, 

W T = J, (Wi 



wt)- 



For stochastic optimization of A-strongly functions, 
the standard analysis (through online learning) focuses 
on the step size rj t being exactly l/\t (Hazan et al. 
2007). Our analysis will consider more general step- 



size of 0(l/i) is necessary for the algorithm to obtain 
an optimal convergence rate (see Appendix [A| . 

In general, we will assume that regardless of how Wi is 
initialized, it holds that E[||g t || 2 ] < G 2 for some fixed 
constant G. Note that this is a somewhat weaker as- 



sumption than ( |Hazan fc Kale 2011), which required 
that ||gt|| 2 < G z with probability 1, since we focus 
only on bounds which hold in expectation. These types 
of assumptions are common in the literature, and are 
generally implied by taking W to be a bounded do- 
main, or alternatively, assuming that wi is initialized 
not too far from w* and F satisfies certain technical 
conditions (see for instance the proof of Theorem 1 in 
(|Shalev-Shwartz et al.[ |2011|)). 



Full proofs of our results are provided in Appendix |B| 
3. Smooth Functions 

We begin by considering the case where the expected 
function F(-) is both strongly convex and smooth with 
respect to w*. Our starting point is to show a C(l/T) 
for the last point obtained by SGD. This result is well 



known in the literature (see for instance (Nemirovski 
et al. 20091) and we include a proof for completeness 



Later on, we will show how to extend it to a high- 
probability bound. 

Theorem 1. Suppose F is X-strongly convex and /i- 
smooth with respect to w* over a convex set W, and 
that E[||g t || 2 ] < G 2 . Then if we pick r) t = c/Xt for 
some constant c > 1/2, it holds for any T that 

E[F(w T )-F(w*)]<imax|4, 



The theorem is an immediate corollary of the following 
key lemma, and the definition of ^-smoothness with 
respect to w*. 

Lemma 1. Suppose F is X-strongly convex over a con- 
vex set W , and that E[||g t || 2 ] < G 2 . Then if we pick 
•q t = c/Xt for some constant c > 1/2, it holds for any 
T that 



E [||w T - w*|| 2 ] < max |4 , 



G 2 



2-1/cJ A 2 T 



sizes c/Xt, where c is a constant. We note that a step 



We now turn to discuss the behavior of the average 
point wt = (wi + . . . + wt)/T, and show that for 
smooth F, it also enjoys an optimal C(l/T) conver- 
gence rate (with even better dependence on c). 

Theorem 2. Suppose F is X-strongly convex and /i- 
smooth with respect to w* over a convex set W, and 
that E[||g t || 2 ] < G 2 . Then if we pick r\ t = c/Xt for 
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some constant c > 1/2. E[-F(wx) — F(w*)] is at most 



2 max ■ 



4c 



1 



A 2 ' A ' A V 2-1/cfT" 



A rough proof intuition is the following: Lemma [T] 
implies that the Euclidean distance of w t from w* is 
on the order of so the squared distance of 

from w* is on the order of ((1/T) J2?=i lM) 2 ~ 1/T, 
and the rest follows from smoothness. 

4. Non-Smooth Functions 

We now turn to the discuss the more general case 
where the function F may not be smooth (i.e. there 
is no constant fi which satisfies Eq. |2]) uniformly for 
all w G W). In the context of learning, this may hap- 
pen when we try to learn a predictor with respect to 
a non-smooth loss function, such as the hinge loss. 

As discussed earlier, SGD with averaging is known to 
have a rate of at most C(log(T)/T). In the previous 
section, we saw that for smooth F, the rate is actu- 



ally <D{ 1/T). Moreover, (Hazan fc Kale 2011) showed 



that for using a different algorithm than SGD, one can 
obtain a rate of 0(1/T) even in the non-smooth case. 
This might lead us to believe that an 0{1/T) rate for 
SGD is possible in the non-smooth case, and that the 
C(log(T)/T) analysis is simply not tight. 

However, this intuition turns out to be wrong. Be- 
low, we show that there are strongly convex stochastic 
optimization problems in Euclidean space, in which 
the convergence rate of SGD with averaging is lower 
bounded by fi(log(T)/T). Thus, the logarithm in the 
bound is not merely a shortcoming in the standard 
online analysis of SGD, but is really a property of the 
algorithm. 

We begin with the following relatively simple example, 
which shows the essence of the idea. Let F be the 1- 
strongly convex function 



F(w) 



1 



Wi, 



over the domain W = [0, l] d , which has a global 
minimum at 0. Suppose the stochastic gradient or- 
acle, given a point w t , returns the gradient estimate 
g t = w t + (Z t , 0, . . . , 0), where Z t is uniformly dis- 
tributed over [—1,3]. It is easily verified that E[g t ] 
is a subgradient of F(w t ), and that E[||g t || 2 ] < d + 5 
which is a bounded quantity for fixed d. 



The following theorem implies in this case, the conver- 
gence rate of SGD with averaging has a f2(log(T)/T) 



lower bound. The intuition for this is that the global 
optimum lies at a corner of W, so SGD "approaches" 
it only from one direction. As a result, averaging the 
points returned by SGD actually hurts us. 
Theorem 3. Consider the strongly convex stochastic 
optimization problem presented above. If SGD is ini- 
tialized at any point in W, and ran with r\ t = c/t, then 
for any T > Tq + 1, where Tq — max{2, c/2}, we have 



E[F(w T ) -F(w*)] > 



T-l 

16T ^ t 

t=T 



When c is considered a constant, this lower bound is 
n(log(T)/T). 

While the lower bound scales with c, we remind the 
reader that one must pick r/t — c/t with constant c for 
an optimal convergence rate in general (see discussion 
in Sec. [2]). 

This example is relatively straightforward but not fully 
satisfying, since it crucially relies on the fact that w* 
is on the border of W. In strongly convex problems, 
w* usually lies in the interior of W, so perhaps the 
£7(log(T)/T) lower bound does not hold in such cases. 
Our main result, presented below, shows that this is 
not the case, and that even if w* is well inside the inte- 
rior of W, an f2(log(T) /T) rate for SGD with averaging 
can be unavoidable. The intuition is that we construct 
a non-smooth F, which forces w t to approach the opti- 
mum from just one direction, creating the same effect 
as in the previous example. 

In particular, let F be the 1-strongly convex function 



F{w) = -||w| 



Wi Wi > 
— 7wi w\ < 



over the domain W = [— 1, 1] , which has a global 
minimum at 0. Suppose the stochastic gradient oracle, 
given a point w t , returns the gradient estimate 



St 




,0) w 1 > 
.,0) Wl <0 



where Z t is a random variable uniformly distributed 
over [—1,3]. It is easily verified that E[g t ] is a subgra- 
dient of F(w t ), and that E[||gj|| 2 ] < d + 63 which is a 
bounded quantity for fixed d. 

Theorem 4. Consider the strongly convex stochas- 
tic optimization problem presented above. If SGD is 
initialized at any point Wi with wi.i > 0, and ran 
with r/t — c/t, then for any T > Tq + 2, where 
Tq = max{2, 6c + 1}, we have 

T 

E[F(w T )-F(w*)} > ^ E 

t=T„+2 



To 
T ' 
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When c is considered a constant, this lower bound is 
n(log(T)/T). 

We note that the requirement of w^i > is just 
for convenience, and the analysis also carries through, 
with some second-order factors, if we let w\ t \ < 0. 

5. Recovering an 0(1/T) Rate for SGD 
with ct-Suffix Averaging 

In the previous section, we showed that SGD with 
averaging may have a rate of Q(log(T)/T) for non- 
smooth F. To get the optimal 0(1/T) rate for any 
F, we might turn to the algorithms of (Hazan & 



Kale 2011) and (Juditsky & Nesterov 2010). How- 



ever, these algorithms constitute a significant depar- 
ture from standard SGD. In this section, we show that 
it is actually possible to get an 0(1/T) rate using a 
much simpler modification of the algorithm: given the 
sequence of points Wj., . .. ,wj provided by SGD, in- 
stead of returning the average w-r = (wi+. . .+Wt)/T, 
we average and return just a suffix, namely 

-a w (l-ct)T+l + • • ■ + w T 



aT 

for some constant a G (0, 1) (assuming aT and 
(1 — a)T are integers). We call this procedure a-suffix 
averaging. 

Theorem 5. Consider SGD with a-suffix averaging as 
described above, and with step sizes rjt — c/Xt where 
c > 1/2 is a constant. Suppose F is X-strongly convex, 
and that E[||g t || 2 ] < G for all t. Then for any T, it 
holds that 



E[F(w£) - F(w*)] < 



( C '+(f+cQlog( T ^)) G2 
a XT' 



where d = max 



{c ' 4-2/c}' 



Note that for any constant a <E (0, 1), the bound above 
is 0(G 2 /XT). This applies to any relevant step size 



c/Xt, and matches the optimal guarantees in (Hazan 
& Kale 2011) up to constant factors. However, this is 



shown for standard SGD, as opposed to the more spe- 
cialized algorithm of (Hazan & Kale 2011). Finally, 



we note that it might be tempting to use Thm. [5] as a 
guide to choose the averaging window, by optimizing 
the bound for a (for instance, for c = 1, the optimum 
is achieved around a ~ 0.65). However, we note that 
the optimal value of a is dependent on the constants 
in the bound, which may not be the tightest or most 
"correct" ones. 

Proof Sketch. The proof combines the analysis of 



Lemma [T] In particular, starting as in the proof of 
Lemma [T] and extracting the inner products, we get 



VtG 2 
2 



J2 E[< gi ,w t -w*)] < J2 

t=(l-a)T+l t=(l-a)T+l 

y, fE[||w t -w*|| 2 ] E[||w t+ i-w " J 

{=(l-a)T+l 



2% 



2vt 



(3) 



Rearranging the r.h.s., and using the convexity of F 
to relate the l.h.s. to E[F(w%) - F(w*)], we get a 
convergence upper bound of 

1 (E[||w (1 _ a)T+1 -w*|| 2 ] ^ 

r Lr Vt 



2aT 



V(X-a)T+l 



J2 E[|| Wi -wl 2 ] 



t=(l- Q )T+l 



t=(l-a)T+l 
1 1 



Vt r\ t -\ 



online gradient descent (Hazan et al. 2007) and 



Lemma [T] tells us that with any strongly convex F, 
even non-smooth, we have E[||w f — w*|| 2 ] < 0(l/t). 
Plugging this in and performing a few more manipu- 
lations, the result follows. □ 

One potential disadvantage of suffix averaging is that 
if we cannot store all the iterates w t in memory, then 
we need to know from which iterate aT to start com- 
puting the suffix average (in contrast, standard aver- 
aging can be computed "on-the-fly" without knowing 
the stopping time T in advance). However, even if T 
is not known, this can be easily addressed in several 
ways. For example, since our results are robust to the 
value of a, it is really enough to guess when we passed 
some "constant" portion of all iterates. Alternatively, 
one can divide the rounds into exponentially increasing 
epochs, and maintain the average just of the current 
epoch. Such an average would always correspond to a 
constant-portion suffix of all iterates. 

6. High-Probability Bounds 

All our previous bounds were on the expected subopti- 
mality E[_F(w) — _F(w*)] of an appropriate predictor w. 
We now outline how these results can be strengthened 
to bounds on F(w t ) — F(w*) which hold with arbitrar- 
ily high probability 1 — 5, with the bound depending 
logarithmically on S. They are slightly worse than our 
in-expectation bounds by having worse dependence on 
the step size parameter c and an additional log (log (T)) 
factor (interestingly, a similar factor also appears in 
the analysis of (Hazan & Kale 2011), and we do not 
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know if it is necessary). The key result is the follow- 
ing strengthening of Lemma [T] under slightly stronger 
technical conditions. 

Lemma 2. Let 5 £ (0, 1/e) and T > 4. Suppose F 
is X-strongly convex over a convex set W, and that 
\\St\\ 2 < G 2 with probability 1. Then if we pick r\ t = 
c/Xt for some constant c > 1/2, such that 2c is a whole 
number, it holds with probability at least 1 — 5 that for 
any t € {4c 2 + 4c, . . . , T - 1, T} that 



< 



1 2c 2 G 2 
XH 



8(121G + 1)G 



clog(log(t)/<S) 
At 



We note that the assumptions on 2c and t are only 
for simplifying the result. To obtain high probabil- 
ity versions of Thm. [l] Thm. [2] and Thm. [5j we sim- 
ply need to plug in this lemma in lieu of Lemma [T] in 
their proofs. This leads overall to rates of the form 
0(log(log(T)/<5)/T) which hold with probability 1-5. 
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Figure 1. Results for smooth strongly convex stochastic 
optimization problem. The experiment was repeated 10 
times, and we report the mean and standard deviation for 
each choice of T. The X-axis is the log-number of rounds 
log(T), and the Y-axis is (F(w T )-F(w*))*T. The scaling 
by T means that a roughly constant graph corresponds to 
a 0(1 /T) rate, whereas a linearly increasing graph corre- 
sponds to a 9(log(T)/T) rate. 



7. Experiments 

We now turn to empirically study how the algorithms 
behave, and compare it to our theoretical findings. 

We studied the following four algorithms: 

1. Sgd-A: Performing SGD and then returning the 
average point over all T rounds. 

2. SGD-a: Performing SGD with a-suffix averaging. 
We chose a — 1/2 - namely, we return the average 
point over the last T/2 rounds. 

3. Sgd-L: Performing SGD and returning the point 
obtained in the last round. 



4. Epoch-Gd: The optimal algorithm of (Hazan & 



Kale, 2011) for strongly convex stochastic opti- 



mization. 

First, as a simple sanity check, we measured the perfor- 
mance of these algorithms on a simple, strongly convex 
stochastic optimization problem, which is also smooth. 
We define W = [-1,1] 5 , and F(w) = ||w|| 2 . The 
stochastic gradient oracle, given a point w, returns 
the stochastic gradient w + z where z is uniformly dis- 
tributed in [— 1,1] 5 . Clearly, this is an unbiased esti- 
mate of the gradient of F at w. The initial point Wi of 
all 4 algorithms was chosen uniformly at random from 
W. The results are presented in Fig. [T] and it is clear 
that all 4 algorithms indeed achieve a 0(1/T) rate, 
matching our theoretical analysis (Thm. [Tj Thm. [2] 
and Thm. [5]). The results also seem to indicate that 
Sgd-A has a somewhat worse performance in terms of 
leading constants. 



Second, as another simple experiment, we measured 
the performance of the algorithms on the non-smooth, 
strongly convex problem described in the proof of 
Thm. [4] In particular, we simulated this problem with 
d = 5, and picked wi uniformly at random from W. 
The results are presented in Fig. [2j As our theory 
indicates, Sgd-A seems to have an 0(log(T)/T) con- 
vergence rate, whereas the other 3 algorithms all seem 
to have the optimal 0(1/T) convergence rate. Among 
these algorithms, the SGD variants Sgd-L and SGD-a 
seem to perform somewhat better than EPOCH- Gd. 
Also, while the average performance of Sgd-L and 
SGD-a are similar, SGD-a has less variance. This is 
reasonable, considering the fact that SGD-a returns 
an average of many points, whereas Sgd-L return only 
the very last point. 

Finally, we performed a set of experiments on real- 
world data. We used the same 3 binary classification 



datasets (ccat,cov 1 and astro-ph) used by ( |Shalev- 
|Shwartz et al. 2011) and ( |Joachims 20061, to test the 
performance of optimization algorithms for Support 
Vector Machines using linear kernels. Each of these 
datasets is composed of a training set and a test set. 
Given a training set of instance-label pairs, {xj, 
we defined F to be the standard (non-smooth) objec- 
tive function of Support Vector Machines, namely 

A 1 m 

F(w) = -||w|| 2 + - V max{0, 1 - yfa, w)}. (4) 

J TYl 



Following ( Shalev-Shwartz et al 

4 for CCAT, A 



[2006J), we took A = 
and A = 5 x 10~ 5 



10 

for ASTRO-PH 



2011 ) and (Joachims 
6 for COVl, 



10- 



The stochastic gra- 
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Figure 2. Results for the non-smooth strongly convex 
stochastic optimization problem. The experiment was re- 
peated 10 times, and we report the mean and standard de- 
viation for each choice of T. The X-axis is the log-number 
of rounds log(T), and the Y-axis is (F(wt) ~ F(w*)) * T. 
The scaling by T means that a roughly constant graph cor- 
responds to a 0(1/T) rate, whereas a linearly increasing 
graph corresponds to a 6(log(T)/T) rate. 



dicnt given w t was computed by taking a single ran- 
domly drawn training example (xj, j/j), and computing 
the gradient with respect to that example, namely 

g t = Aw t - l J/ .( x . iWt >< 1 ^x i . 

Each dataset comes with a separate test set, and we 
also report the objective function value with respect 
to that set (as in Eq. Q, this time with {xj,j/j} rep- 
resenting the test set examples). All algorithms were 
initialized at wi = 0, with W = ]R d (i.e. no projections 
were performed - see the discussion in Sec. [2]). 

The results of the experiments are presented in 
Fig. [3jFig. [4] and Fig. [5j In all experiments, Sgd- 
A performed the worst. The other 3 algorithms per- 
formed rather similarly, with SGD-a being slightly bet- 
ter on the Covl dataset, and Sgd-L being slightly 
better on the other 2 datasets. 

In summary, our experiments indicate the following: 

• Sgd-A, which averages over all T predictors, is 
worse than the other approaches. This accords 
with our theory, as well as the results reported in 



( |Shalev-Shwartz et al.[|2011 ). 



The Epoch- Gd algorithm does have better per- 
formance than Sgd-A, but a similar or better 
performance was obtained using the simpler ap- 
proaches of a-sufnx averaging (SGD-a) or even 
just returning the last predictor (Sgd-L). The 
good performance of SGD-a is supported by our 
theoretical results, and so does the performance of 
Sgd-L in the strongly convex and smooth case. 







SGD-A 

SGD-a 

SGD-L 

EPOCH-GD 







4 6 8 

iog 2 (T) 

ASTRO - Test Loss 




Figure 3. Results for the ASTRO-ph dataset. The left row 
refers to the average loss on the training data, and the 
right row refers to the average loss on the test data. Each 
experiment was repeated 10 times, and we report the mean 
and standard deviation for each choice of T. The X-axis is 
the log-number of rounds log(T), and the Y-axis is the log 
of the objective function log(F(wT)). 
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Figure 4. Results for the CCAT dataset. See Fig. [3] caption 
for details. 
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Figure 5. Results for the CCAT dataset. See Fig. [3] caption 
for details. 



• Sgd-L also performed rather well (with what 
seems like a 0(1 /T) rate) on the non-smooth 
problem reported in Fig. [2] although with a larger 
variance than SGD-a. Our current theory does 
not cover the convergence of the last predictor in 
non-smooth problems - see the discussion below. 

8. Discussion 

In this paper, we analyzed the behavior of SGD for 
strongly convex stochastic optimization problems. We 
demonstrated that this simple and well-known algo- 
rithm performs optimally whenever the underlying 
function is smooth, but the standard averaging step 
can make it suboptimal for non-smooth problems. 
However, a simple modification of the averaging step 
suffices to recover the optimal rate, and a more sophis- 
ticated algorithm is not necessary. Our experiments 
seem to support this conclusion. 

There are several open issues remaining. In particular, 
the 0(1/T) rate in the non-smooth case still requires 
some sort of averaging. However, in our experiments 



and other studies (e.g. ( Shalev-Shwartz et al. 2011)), 



returning the last iterate wt also seems to perform 
quite well. Our current theory does not cover this - 
at best, one can use Lemma[T]and Jensen's inequality 
to argue that the last iterate has a 0(1/ VT) rate, but 
the behavior in practice is clearly much better. Does 



SGD, without averaging, obtain an 0(1/T) rate for 
general strongly convex problems? Also, a fuller em- 
pirical study is warranted of whether and which aver- 
aging scheme is best in practice. 

Acknowledgements: We thank Elad Hazan and 
Satyen Kale for helpful comments on an earlier ver- 
sion of this paper. 
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A. Justifying r\ t = 0(l/t) Step-Sizes 

In this appendix, we justify our focus on the step-size regime r\t — 0(l/i), by showing that for other step sizes, 
one cannot hope for an optimal convergence rate in general. 

Let us begin by considering the scalar, strongly convex function F(w) = \vj 2 , in the deterministic case where 
9t = 9t — V-F(wj) = Wt with probability 1, and show that rj t cannot be smaller than f2(l/f). Intuitively, such 
small step sizes do not allow the iterates w t to move towards the optimum sufficiently fast. More formally, starting 
from (say) wi — 1 and using the recursive equality Wt+i — w t — T]t9t> we immediately get wt — X\!t=\(l — '?*)• 
Thus, if we want to obtain a 0(1/T) convergence rate using the iterates returned by the algorithm, we must at 
least require that 

l[(l-Vt) < 0(1/T). 
t=i 

This is equivalent to requiring that 

T-1 

-J2^g(i-vt) > n(iog(T)). 

t=i 

For large enough t and small enough rj t , log(l — i] t ) w —T}t, and we get that J3t= 1 Vt must scale at least 
logarithmically with T. This requires rjt > f2(l/t). 

To show that r\t cannot be larger than 0(l/t), one can consider the function F(w) = \w 1 + w over the domain 
W = [0, 1] - this is a one-dimensional special case of the example considered in Thm. [3] Intuitively, with an 
appropriate stochastic gradient model, the random fluctuations in F(w t +i) (conditioned on Wi, . . . , w t ) are of 
order rjt, so we need rjt = 0(1/ 1) to get optimal rates. In particular, in the proof of Thm. |3j we show that for an 
appropriate stochastic gradient model, E[F(w t )] > K[wi\ > f?t/16 (see Eq. A similar lower bound can also 

be shown for the unconstrained setting considered in Thm. [4j 

B. Proofs 

B.l. Some Technical Results 

In this subsection we collect some technical Results we will need for the other proofs. 

Lemma 3. Let a > 1, b > and x\ € [0, D] be arbitrary constants. Let X2,X3, ... be a non-negative sequence 
which satisfies 

f a\ b 
x t+1 <[l--)x t + -. 

Then for all t, 

max(D,b/(a - 1)) 



Proof. Let m — max(D, b/(a — 1)). The proof is by simple induction. The assertion clearly holds for t — 1. Now, 
suppose that Xt < f- Then it suffices to show that 

/ a\ m b m 

This can be simplified to 

6(* + 1) < m((o- l)t + a). 

which clearly holds &s m > b/(a — 1). □ 

Lemma 4. Let b > 0,c > and Xi G [0,1?] be arbitrary constants. Let 2:2,^3, ... be a non-negative sequence 
which satisfies 

, t V b ^ c 



t + lj 1 (* + 1)3/2 V - {t+l) 
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then for all t, 

max{D,bV2+ y/c/2} 



x t < 

i 

Proof. Let m = max{.D, 6y2+ y c/2}. As in the previous lemma, the proof is by induction. The assertion 
clearly holds for t = 1. Now, suppose that x t < ™. Then it suffices to show that 

c m 

< 



k t + iy t (i + 1) 3 / 2 V * (t + i) 3 ~t + i' 

This can be simplified to 

(t + l)m - 6(t + l)y^pVm - c > 0. 
By solving the quadratic inequality, we get that this holds whenever 



& + 1 1 /,,t + l 4c 

m > - \ / 1 — ^ / o z - 



2\ t 2 V * i+ 1 
To ensure this holds for all t, it suffices to verify for the value of t maximizing the right hand side, namely t = 1: 



> h -V2+ \^JW+2c, 



m . _ 

2 2 

which follows immediately from the definition of m. □ 
Lemma 5. 7/E[||gi|| 2 ] < G 2 , then 

Elllwi-w*!! 2 ]^^. 

Proof. Intuitively, the lemma holds because the strong convexity of F implies that the expected value of ||gi|| 2 
must strictly increase as we get farther from w*. More precisely, strong convexity implies that for any wi, 

(gi, wi - w*) > -||wi - w*|| 2 

so by the Cauchy-Schwartz inequality, 

l|gi|| 2 >^l|wi-w*|| 2 . (5) 
Also, we have that whether gi is random or not (depending on whether Wi is chosen arbitrarily or randomly), 

E[||gi|| 2 ] = E[||gi + (gi - gi)|| 2 ] = E[||g 1 || 2 ]+E[||g 1 -g 1 || 2 ]+2E[(g 1 -g 1 ,g 1 )] > E[|| gl || 2 ]. 
Combining this and Eq. we get that for all t, 

A Af 2 

IECHw! - w*|| 2 ] < _]E[||g t || 2 ] < 

□ 



The following version of Freedman's inequality appears in (De La Pena 1999) (Theorem 1.2A): 

Theorem 6. Let di,...,dr be a martingale difference sequence with a uniform upper bound b on the steps di. 
Let V denote the sum of conditional variances, 

s 

V s = ^ Var(di | di,..., 

Then, for every a, v > 0, 

Prob I di > a and V s < v for some s < T J < exp 



\i=l 



2(v + ba) 
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The proof of the following lemma is taken almost verbatim from ( Bartlett et al. 2008 ), with the only modification 



being the use of Theorem [6] to avoid an unnecessary union bound. 
Lemma 6. Let di,...,dx be a martingale difference sequence with a uniform bound \di 



< b. Let V s 



y)t—l Var t -i(dt) be the sum of conditional variances of dt's. 
6 < 1/e and T > 4, 



Further, let er. 



'V s . Then we have, for any 



Prob [y^dt > 2max{2ff S! &V'RlM)} n/MV^) f or some s < T ) < l og{T)S 



(6) 



Proof. Note that a crude upper bound on Var t ei t is b 2 . Thus, a s < b\JT . We choose a discretization = a_i < 
a < . . . < ai such that cn + \ — 2ai for i > and oti > b^/T. We will specify the choice of ao shortly. We then 
have, 



Prob I dt > 2max{2cr s , ao}\/ln(l/8) for some s <T 



\t=i 

i 



= Y Prob ( £t=i dt > 2max{2 ( r s ,ao} v /ln(l/<5) for ^ g j, 

< Y Prob f^=i ^ > 2^yin(l/5) somc < ■ 

I / a \ 

< y Prob d t > 2a 3 \/hi(l/<5) & V s < a 2 for some s < T 
j=o \t=i J 



i 

< yexp 



-4a 2 \n(l/6) 



J^exp 



-2a.,ln(l/5) 



3=0 \ a j + I 



where the last inequality follows from Theorem^ If we now choose a = b^hi(l/S), then aj > &A/ln(l/(5) 
for all j. Hence every term in the above summation is bounded by exp ( ~ 1+2/3*^ ) ^ ^' Choosing Z = log(-\/T) 
ensures that a; > bVT. Thus we have 



Prob \^2 X t > 2max{2CT s ,fe v / ln ( 1 /' 5 )}v /ln ( 1 A)^ = p rob ^ X* > 2 max{2cr s , a }- v / ln(l/J) 



< (/ + 1)5 = (log(VT) + 1)5 < log(T)<5 . 

□ 



B.2. Proof of Lemma [T] 

By the strong convexity of F and the fact that w* minimizes F in W, we have 

(gt, w t - w*) > F(w t ) - F(w*) + ^||w t - w*|| 2 , 



as well as 



F(w t )-F(w*)> ^|| Wt -w*|| 2 . 
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Also, by convexity of W, for any point v and any w € W we have ||ITyy(v) — w|| < ||v — w||. Using these 
inequalities, we have the following: 

E[||w m -w*|| 2 ] = E[||n w (w t -r/ t g t )-w*|| 2 ] 

< E[||w t -r/ t g 4 -w*|| 2 ] 

= E[||w 4 -w*|| 2 ] -27 ?t E[(g t ,w t -w*)]+7 7 2 E[||g t || 2 ] 

= E[||w t -w*|| 2 ] -2?7tE[(g t ,w f -w*>]+?7 2 E[||g t || 2 ] 



F(w t )-F(w*) + -||w t -w*|| 2 



W( — w 



w t — w 



< E[||w t -w*|| 2 ] -2r? t E 

< E[||w t -w*|| 2 ] -2 Vt E 
= (l-2 % A)E[||w t -w*|| 2 ] + 77 2 G 2 . 

Invoking Lemma[3j using the fact that rj t = c/Xt and E[||wi — w*|| 2 ] < (by Lemmaji]), we get that 

E[||w r -wl 2 ]<|^max{4,^}. 



B.3. Proof of Thm. E] 

For any t, define w t = (wi 



w t )/t. Then we have 



E [||w tH 



i — w 



t 



1 



w t +i - w 



1 1 + 1 t + 1 

l^fwi - w *) + ^ry( w t+i - w *)ll 2 



T J E[||w t -w*|| 2 ] 



/ + '"" " 11 J ' (< + i) 2 
* ^ 2 



2^ 1 

E [(wj - w*, w t+ i - w*)] + u ■ iV2 E [||w t+ i - w* 



< — E[||w t -w*|| 2 ] +— E[||w t -w*|| ||w t+1 -w* 



i 



E [||w tH 



i — w 



Using the inequality E[|-XT|] < ^ELY 2 iyE[y 2 ] for any random variables X, Y (which follows from Cauchy- 
Schwartz), and the bound of Lemma TJ we get that E [||w t+1 — w*j| 2 ] is at most 



t VWII- ,, |21 , 2max{2, v /c/(2-l/c)}G ^ — 

— I E [|| w t — w || J H xu ■ iVS/2 V E [II W * - w In + 



max{4, c/(2- l/c)}G 2 



t+lj ~ L "" " » J ■ A(t + 1) 3 / 2 V-LII-I " IIJ ' A 2 (t+1) 3 

Using Lemma |4j and using the fact that E[||wi — w*|| 2 ] < 4G 2 /A 2 by Lemma[5j we get that 

wril - *|,2i ^ 1 /4G 2 5max{2,y C /(2-l/c)}G \ 
E[||w T -w||]<-maxj— , — j. 

By the assumed smoothness of F with respect to w*, we have F(wr) — F(w*) < ^||wr — w*|| 2 . Combining it 
with the inequality above, and slightly upper bounding the constants for readability, the result follows. 

B.4. Proof of Thm. [3] 

The SGD iterate can be written separately for the first coordinate as 

w t +i,i = II[o,i] ((1 - Vt)w t ,i - r/tZt) (7) 
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Fix some t > T , and suppose first that Z t < —1/2. Conditioned on this event, we have 
wt+1.1 > n [0 ,i] ((1 - m) w t,i + ^Vt) > n [o,i](?7t/2) > ri t /2, 

since t > T implies r]t = c/t < 2. On the other hand, if Z t > —1/2, we are still guaranteed that w t +i.i > by 
the domain constraints. Using these results, we get 

E[wt +lll ] = Pr(Z t < -1/2JEK+1.1 | Zt < -1/2] +Pr(Z t > -l/2)E[w i+M \ Z t > -1/2] 

> Pr(Z t < -l/2)E[tfl t+M | Z t < -1/2] 

> Pv(Z t < -1/2)^ = ±th. (8) 

Therefore, 

T T T-l 



E[w TA ] > l e e m > ^ E = ^ X>- 

t=T +l t=T +l f=T 

Thus, by dehnition of F, 

1 T_1 

E[F(w T )-F(w*)] = E[F(w r )] > E [F((w TA , 0, . . . , 0))] > E[w T ,i] > ^= E ^ 



16T 



Substituting 774 = c/t gives the required result. 

B.5. Proof of Thm. H 

The SGD iterate for the first coordinate is 



%u = n h Lii 1-7K1- Vc J • ( 9 ) 



f «>*,!< 



The intuition of the proof is that whenever Wt,i becomes negative, then the large gradient of F causes Wt+x i to 
always be significantly larger than 0. This means that in some sense, Wt t i is "constrained" to be larger than 0, 
mimicking the actual constraint in the example of Thm. [3] and forcing the same kind of behavior, with a resulting 
fi(log(T)/T) rate. 

To make this intuition rigorous, we begin with the following lemma, which shows that Wt,i can never be signifi- 
cantly smaller than 0, or "stay" below for more than one iteration. 

Lemma 7. For any t > Tq — max{2, 6c + 1}, it holds that 

w tA > -~ 

and if Wt 1 < 0, then 

5c 

wt+1,1 > — . 

Proof. Suppose first that Wt-1.1 > 0. Then by Eq. (|9| and the fact that Z t > —1, we get Wt,i > —c/(t — 1). 
Moreover, in that case, if Wt,i < 0, then by Eq. (|9| and the previous observation, 

7c \ „ / / c\ c 7c 



wt+1,1 = n[_i,i] ni-^j w tA + y j > n^i,!] ( - (1 



tJ t- 1 t 

Since t > c, we have 1 — c/t £ (0, 1), which implies that the above is lower bounded by 
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Moreover, since t > 6c, we have —c/(t — 1) + 7c/t G [—1, 1], so the projection operator is unnecessary, and we 
get overall that 

c 7c 5c 

■ — > — . 



t-1 t ~ t 



This result was shown to hold assuming that w t -i,\ > 0. If w t -i.i < 0, then repeating the argument above for 
t — 1 instead of t, we must have w tt i > 5c/ (t — 1) > —c/(t — 1), so the statement in the lemma holds also when 

Wt-1A < 0. □ 



Thm. [4] itself. By Lemma[7| if t > To, then iu^i < implies > 0, and moreover, 

ui^i + ift+i^i > —c/(t — 1) + 5c/f > 3c/t. Therefore, we have the following, where the sums below are only over 



We turn to the proof of 
W t ,l + Wt+1,1 > -c/(t - 
fs which are between Tq and T 



E[w To , 



1 



w T a\ 



E[2w T(u 



2w T ,i] > 2 E 



(w t ,i + w t+ i.i) + ^ Wt < i 

t:wt,i<0 t:w t ,i>c/t 



> Ie 

- 2 



> -E 
~ 2 



T + E i 

4:it>t J i<0 t\w t ,i>c/t 



E 

S:-tu t ,i£[0,c/t] 



E Pr Ki^[o,%])^ 



(10) 



Now, we claim that the probabilities above can be lower bounded by a constant. To sec this, consider each such 
probability for w t \, conditioned on the event w t -i,i > 0. Using the fact that c/t < 1, we have 

PrK,i e [0, c/t] I w t . hl > 0) = Pr (n [ _ 1)1] ((1 - c/t) wt-i,i - (c/t)Z t ) e [0, c/t] I vH-1,1 > o) 

= Pr ((1 - c/t) w t - ltl - {c/t)Z t G [0, c/t] | ttf t _i,i > 0) 

= Pr {Z t G [(t/c - 1) w t _ M - 1 , (t/c - 1) w t -i,i] | fUt-1,1 > 0) . 

This probability is at most 1/4, since it asks for Z t being constrained in an interval of size 1, whereas Z t is 
uniformly distributed over [—1,3], which is an interval of length 4. As a result, we get 

Pr(u> M i [0,c/t] | ttf t _ M > 0) = 1 - Pr(twt,i G [0,c/t] \ w t -i A > 0) > ~. 



From this, we can lower bound Eq. ( JT0[ ) as follows: 

X)PrKi £ [°>%])l > 5> r ( w M ^ [0,%W-i,i > 0) 



5^Pr(«fc_i,i > 0)Pr(w M £ [0,7 7t ] | tot-i,! > 0) 



2/ 



3 T c 
> „E rPr(*0t-i,i>0) 

t=T 

= — E 
16 



-E 



EC 
Tltu t _i,i>0 



t=T 



J] ^ Wi,i>0 + Yj 



Lio t _ii>0 



Ae 

16 



.t=T 
T-1 



t=T 
T-1 



" 16 



T-1 



E jVwy + E 

,t=T t=T 



L wt,i>0 



> — E 

~ 16 



^ 7lnr t -l,i>0 + XI -lu) t _i,!>0 
t=T t=T + l 

T-1 



E 

,t=Tn + l 



t + 1 



L u> t _ li:1 >0 



lw t ,i>o) 



Now, by Lemma [7J for any realization of wt ,i, - ■ ■ ,wt,i, the indicators l Wt 1 >q cannot equal consecutively. 
Therefore, 1^,^ x >o + l Wt ± >a must always be at least 1. Plugging it in the equation above, we get the lower 
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bound (3c/16) Y^t=T +2 \- Summing up, we have shown that 

3c T 1 

E[wt ,i + • • • + w t,i] > Jq f- 

t=T +2 

By the boundedness assumption on W, we hav^ju> t i > — 1, so 

E[wx,i + ■■■ + w To -i,i] > -T - 

Overall, we get 



E[wt,i] = 



' T 

,*=i 



3c ^ fl\ T 



16T \ t J T 

t=T +2 



Therefore, 

E[F(w r )-F(w*)] = E[F(w T )] > E[F((«j Tjl ,0,...,0))]>E[«j Tjl ] > ^ £ 
as required. 

B.6. Proof of Thm. [5] 

Proof. Using the derivation as in the proof of Lemmajl] we can upper bound E[||w t+ i — w*|| 2 ] by 

E[||w ( - w*|| 2 ] - 2r ?t E[(g t ,w t - w*)] + rfiG 2 . 
Extracting the inner product and summing over t = (1 — a)T + 1, . . . , T, we get 

t E [(g „w«-w.)]< ± *£ + ± pw.-win _ E[ii,^-,-in 

t=(l-a)T+l i=(l-a)T+l f=(l-a)T+l ^ ^* ^ 

By convexity of F, J2t=(i- a )T+i E [(g* i w * - w *)] is l° wer bounded by 

T 

E[F(w t ) - F(w*)] > aTE [(F(wy) — _F(w*))] . 

{=(l-a)T+l 



Substituting this lower bound into Eq. (11 1 and slightly rearranging the right hand side, we get that E[F(w^) 
F(w*)] can be upper bounded by 

rp rp 

— E[||w (1 _ Q)T+1 -w*|| 2 ]+ E[||w,-w*|| 2 ]fi-— )+G 2 £ * 

2aT y V(1 _ a)T+1 t=(1 ~)T+i U Vt-J t=(1 ^ )T+1 

Now, we invoke Lemma [l] which tells us that with any strongly convex F, even non-smooth, we have E[||w i 
w*|j 2 ] < 0(l/t). More specifically, we can upper bound the expression above by 

max{4,^}G 2 / 1 T 1/r)t _ 1/r) \ Q 2 T 

2aT\ 2 1 ((1 - a)T + l)na-a) T +i ^ t ) 2aT ^ Vt ' 

\ VV ; ' 1(1 a ) 1 + L t=(l- a )T+l / t=(l-a)T+l 



In particular, since we take rj t = c/Xt, we get 



2caTX ^ i I 

\ t=(l-a)T+l / 



cG 2 1 

t=(l-a)T+l 



It can be shown that St=(i- a )T+i i — i°&(l/(l — a ))- Plugging it in an d slightly simplifying, we get the desired 
bound. □ 



2 To analyze the case where W is unbounded, one can replace this by a coarse bound on how much Wt,i can change in 
the first To iterations, since the step sizes are bounded and To is essentially a constant anyway. 
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B.7. Proof of Lemma [2] 

To prove this lemma, we will first prove the following auxiliary result, which rewrites ||w t+ i — w*|| 2 in a more 
explicit form. 

Lemma 8. Under the conditions of Lemma^ it holds for any t > 4c 2 + 4c that 



iw t+1 -wT < n (1 



A f — ' % 

i—2c \ 



2c 



W,; - W , Z, 



3c 2 G 2 



Proof. By the strong convexity of F and the fact that w* minimizes F in W, we have 



as well as 



(g t , w t - w*) > F(w t ) - F(w*) + ~||w t - w*|| 2 , 



F(w t )-F(w*) > ^||w t -w*|| 2 . 



Also, by convexity of W, for any point v and any w € W we have ||IIw(v) — w|| < ||v — w||. Using these 
inequalities, we have the following: 



w i+ i - w 



* 1 1 2 



n w (w t - rjtgt) - w 



* ||2 



< 


||w t 


- vtkt - 


-w*|| 2 




ll w * 


-w*|| 2 


- 2%(gt>w t - 






-w*|| 2 


- 2r?t<gt,w t - 


< 


\\ w t 


-w*|| 2 


- 2 % F(w t ) - 


< 


||w t 


-w*|| 2 


A„ 



% 2 llg*ll 2 

277 t (z t ,-w 

+ 2 I|W * 
A, 



* 1 1 2 



„2/-<2 



'I 2 + g H Wt - w *ll 2 + 2fjt<St,Wt - w*> + ?7 2 G 2 



= (1 - 2j7tA)||w t - w*|| 2 + 2j7t(z t , w t - w*> + r] 2 G 2 . 
For simplicity of notation, let us write := w t — w*. Plugging in our choice of r/ t , we get 



4 +1 H 2 



2c 



t 



2c 



< (i--)K|| 2 + -( Zt x 



cG 



Unwinding this recursive inequality till t = 2c, we get that for any t > 2c, 



K +1 || 2 < 



2c Al 



x£ n (1 

j=2c \ j=i+l 



2c 



2c 
3 



We now note that for any a > 2c, 

t , „ \ t 



n 



2c 



n 



t - 2c 



n 



0-1 



n 



< 



i=t-2c+l 



a- 1 
t - 2c + 1 



< 



Plugging this back and slightly simplifying the upper bound, we get 



4 +1 ir < j 



2c 1 



e n 1 



2c 
J 



a-1 

t - 2c 



A 2 ^ i 2 V t - 2c 

i=2c 



c 2 G 2 1 



E 



2 c 



i—2c \j— 

We also note that if 2c > 2 (which holds assuming c > 1/2 and that 2c is a whole number) , then 



E 



1 

i 2 V t - 2c 



V C < (i-2c)- 2c Vi 2 - 2 < (i - 2c)- 2c [ t+i f 
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For any t > 4c + 4c, this is at most 



I f 1 \ 2c 3 3 

1 + — < tz — < 



(2c-V)t\ 2c J ~ (2c-l)t ~ t 
Therefore, we get that for any t > Ac 2 + 4c, 



, ,,, 2c 1 / Vt /, 2c\ \ ,. 3c 2 G 2 



□ 



With this result at hand, we are now in a position to prove our high probability bound. Denote Xi — ||wj — w* || 2 
and Zi = (wj — w*,z,-). We have that the conditional expectation of Zj, given previous rounds, is Ej_i[Z,-] = 0, 
and the conditional variance Var i _ 1 (Z i ) < Xi ■ G 2 . By Lemma|8j 



i=2c 



where 



with ^ = (i + 1 — 2c) X . . . X (i — 1) and 74 = ((i + 1 — 2c) x . . . x (t — Let us now study the sum 



i=2c 

of the martingale differences di — faZi. Observe that the sum of conditional variances satisfies 

t t 
a 2 = J2 Var^iCAZj) < £ j3 2 G 2 Xi 

i=2c i=2c 

and we also have the uniform bound 



IftZil < G/3 4 < Gft = G(t + 1 - 2c) X . . . X (t - 1) := 6 4 



We now apply Lemma [fj] to the sum of martingale differences Y^t=2cPi Z i- We get that with probability at least 
1-5, 

t 

PiZt < 2 max {2CT t ,& t >/ln(ln(T)/<5)} VM m ( r )/<*) for all i < T. 

i=2c 



Recall that a t = GyJ2i=2c PfXi- Define a shorthand B = ^/ln(ln(T)/<5). Multiplying both sides by 74, we have 
for all t < T, 



4 t)z i = 7t& Z i < max I *YtBG 



\ 



£ (3 2 X t , 2 lt b t B 2 
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Observe that j t b t < Gt~ x . Using Eq. (Ill, we have 



i=2c 



2cG 

< —j— max ^ 47 t i( 



\ 



2B 2 



3c 2 G 2 



X 2 t 



< 



8cBG 



£(ft 7 *) 2 ^ 

\ t=2 



AcB 2 G 3c 2 G 2 



-2c 



Xf 



XH 



We also have 



i (t + 1 - 2c) . . . (t - l)t J ~ f 4 



Assume by the way of induction that Xj, < Ai 1 for all i < t for some constant A to be defined later. Now, let 
us prove the result for X t +\. We have 



8cBG 
X t+1 < — 

8cBG 



J^ ro 4c£ 2 G 3c 2 G 2 

2_,(A7t) ^ ■ 



At X 2 t 

-2c 



X 



A 



4cB 2 G 3c 2 G 2 



Xt X 2 t 

i=2c 



8cBG / „/ l\ 4c ~ 2 A 4cB 2 G 3c 2 G 2 

^^rf - 2 { 1+ t) i^2 + ^t + ^h- 



22cBG I A AcB 2 G 3c 2 G 2 
~^X~V A^2 + ^r + ^T' 

Where in the last step we used the fact that since t > 2c, then (1 + l/t) 2c_1 < (1 + l/2c) 2c < e. We would like 
the last quantity to be less than A jit + 1) in order to prove the induction step. Since ^£1 < 2, it is enough to 
find A that satisfies 

22cBG / A AcB 2 G 3c 2 G 2 A 

< 



X V4c-2 A A 2 - 2 ' 

This can be written as a quadratic equation of the form ^ — y/ATi — T 2 > 0, with a feasible solution for A being 
any upper bound on (ri + y/Tj + 2T 2 ) 2 < 4r 2 + 4T 2 . This gives 

/(22c5G) 2 4cB 2 G 3c 2 G 2 
A - ' 1 



VA 2 (4c-2) A A 2 

Upper bounding c/(4c— 2) by 1/2 (since 2c is a whole number and c > 1/2), and slightly simplifying, we get the 
required bound. 



